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CFD code 
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Department of Mechanical Engineering, 
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Lexington, Kentucky 40506, USA 


Abstract 


A parallelization study designed for ADI-type algorithms is presented using 
the OpenMP specification for shared-memory multiprocessor programming. 
Details of optimizations specifically addressed to cache-based computer ar- 
chitectures are described and performance measurements for the single and 
multiprocessor implementation are summarized. The paper demonstrates 
that optimization of memory access on a cache-based computer architec- 
ture controls the performance of the computational algorithm. A hybrid 
MPI/OpenMP approach is proposed for clusters of shared memory machines 
to further enhance the parallel performance. The method is applied to de- 
velop a new LES/DXS code, named LESTool. A preliminary DXS calculation 
of a fully developed channel flow at a Reynolds number of 180, Re T = 180, 
has shown good agreement with existing data. 
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1 Introduction 


The rapid growth of computer hardware and software has made it possible for 
CFD to evolve into routine design tools for practical engineering applications 
in the 21st century. It is expected that LES and DXS will become standard 
practice in the CFD methodology. However, due to the large computer re- 
sources demanded by LES and DXS in practical engineering applications, 
the speed of paradigm shift not only hinges upon the advancement of new 
computer hardware and software, but also depends on new CFD algorithms 
taking advantage of the new computer hardware and software developments. 

The recent advancement in cache-based Shared Memory Multiprocessor 
(SMP) architectures [4], such as Origin 2000 and HP Exemplar, has provided 
an easy transition of a serial CFD programming style to parallel environ- 
ments. The shared memory approach using OpenMP [6] simplifies the imple- 
mentation of ADI type algorithms compared to the distributed programming- 
model using MPI [5]. With OpenMP, the algorithm can be parallelized along 
lines in the computational domain without specifying data movement, which 
is necessary in a distributed environment. In addition, the OpenMP model 
provides an incremental path to a parallel program. This approach is much 
more efficient than the distributed model, which requires the program’s data 
structures to be explicitly partitioned and hence the entire application must 
be parallelized according to the partitioned data structures. 

The main objective of this paper is to describe the parallel implementation 
of ADI-type Xavier-Stokes solvers on cache-based shared memory parallel 
computers using OpenMP directives. An overview of the implementation of 
ADI-tvpe algorithms is presented and performance results based on the SGI 
Origin 2000 are reported. A number of key features needed to improve the 
speed of the memory access are highlighted. These improvements in memory 
access have led to a high floating point performance in our application code 
(LESTool), which is designed for LES and DXS of turbulent flow using high- 
order discretization schemes. 


2 Governing Equations 

The fluid motion is governed by the time-dependent Xavier-Stokes equations 
for an ideal gas which express the conservation of mass, momentum, and en- 
ergy for a compressible Xewtonian fluid. The equations written in curvilinear 
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coordinates are: 


dQ dF^ m, dF ( _ dCk dG v dG c 

dt d£ dp d( dt; dr] d( 


( 1 ) 


where Q is the vector of the conservative variables multiplied by the 
volume of the computational cell, V, and is defined by 


Q = V(p, pu, pv, pw, pE) T . (2) 

The F’s and G°s denote the convective and viscous fluxes, respectively, 
and the subscripts £, p and C represent the directions of the fluxes. The 
inviscid fluxes are given by 


Ft = 


( f ' Ui \ 
pul \ + pV dt; / d.r 

pvU ^ + pV (Vf / dy 

pwli \ + pV d^/dz 

\ p{ /.' - p/pii \ pVdv di ) 


F v = 


( PU r, \ 

f)uU n +pV drj/dx 
pvU v + pV dp/dy 
pwU v + pV dp/dz 
\ p(E + p/p)U ri - pV dp/ dt ) 


F< = 


( PU< \ 

puU £ + pV d( / dx 
PvUq +pV d(/dy 
PwUq + pV dC,/dz 

\ p(E+p/p)u c - P vdc/dt ) 


( 3 ) 


( 4 ) 


where p is the density, p is the pressure, u, v , w are the cartesian velocity 
components, E = e + K is the total energy per unit volume and equal to the 
sum of the internal energy, e, and the kinetic energy, K = (it 2 + v 2 + w 2 )/2, 
and the IT s are the contravariant velocity components defined by 
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( 6 ) 


The viscous fluxes in the £, // and £ directions are given by 


Gt 


V 


( 0 \ 

T xx d^/dx + r xy d£/dy + r xz d^/dz 

r xy d^/dx + T yy d£/dy + r yz d^/dz 
T X z d^/dx + r yz d£/dy + t zz d^/dz 
\ 9id^/dx + g 2 d^/dy + g 3 d^/dz ) 


Gy 


V 


( ° \ 

Txx dr]/ dx + r xy dr]/ dy + r xz dr]/dz 

T xy dr]/ dx + Tyy dr]/ dy + r yz dr]/dz 
t X z dr]/ dx + t vz dr]/ dy + r zz dr]/ dz 
\ g l dr]/ dx + g 2 dr]/dy + g 3 drj/dz J 


(7) 


( 8 ) 


G q = V 


( ° \ 

Txx d(//dx + Try dC/dy + t xz dC,/dz 

Txy d(/dx + Tyy d(/dy + t vz dC,/dz 
T X z dC/dx + T yz d(/dy + t zz d(/dz 
\ gid(/dx + g 2 d(/dy + g 3 d(/dz J 


where 


9l = UT XX + VT xy + WT XZ - q x , 

g 2 = UT X y + VTyy + WT yZ ~ (]y , 

93 = UT XZ + VT yz + WT ZZ - q z . 


The stress tensor r and the heat flux vector q are defined by 


(9) 


( 10 ) 
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and 
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( 17 ) 

(18) 
(19) 


where fi and k are molecular viscosity and thermal conductivity, respec- 
tively. 

The gradients in the stresses and heat fluxes of any dependent variable, 
<fi, where <p can be u, v, w or T, are computed as described in appendix A. 3. 

The pressure is related to the density and temperature, T, according to 
the equation of state 


P = (7- 1 ){E ~ pK) (20) 

where 7 is the ratio of the specific heats. The coefficient of viscosity, /;,, and 
thermal conductivity, k, are related by the constant Prandl number Pr. 


k = 


fJrCp 

~Pr 


(21) 
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3 Numerical method 

By defining the advancement of the solution from time level n to n + 1 as 


Q n + 1 = Q n + A Q n , 


( 22 ) 


the implicit, second-order, three-point-backward approximation in time 
for the solution of (1) can be expressed as 


3 A Q n 1 A Q 


n — 1 


2 A t 2 A t 


|(F 4 "+‘ - C“+‘)+ 
d 


(23) 


d 

7n(K +1 


G 


71 + 1 \ 


/^(7i+l\ , ^ ( zxn+1 

dri'~ r > ) + aC ^ c 

The inviscid and viscous fluxes F and G are linearized by a Taylor series 
expansion 


F” +1 = F” + A x AQ n (24) 

G n+i = gn _|_ B x AQ n 

where x can be A r/ or A x and B x are the Jacobian matrices for the 
inviscid and viscous fluxes and are defined in the appendices A.l and A. 2. 
By substituting the linearization (24) into equation (23) one yields 


3 A Q n 
2 At 


|(A - Bi+CT + A(A, - B„)AQ” 
+A(,4 c - b ( )AQ" 


(25) 


A R n 


where A7? n is evaluated explicitly from the previous time level, n, accord- 
ing to 


A R n 



G n ( ) + 



+ §-^ F ( - G ") 

1 AQ n_1 
+ 2 At 


(26) 
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By further defining a delta-delta variable A' Q n,m such that 


AQ n ’ m+ i = AQ n ’ m + A'Q n ’ m (27) 

Qn,m+l = Qn +A Qn,m + 1 ( 28 ) 

where m is the counter for the inner iteration and Q n,m+l and AQ n,m+ 1 
become Q n+l and A Q n , respectively when the solution for the inner loop 
converges, equation (25) can be written in delta-delta form as: 

f A >nn,m f) f) 

2 ^ - B,)A'Q^ + -(A v - B v )A'Q n ’ m (29) 

• — (,l v - B c )A'Q n,m = A R n ' m 

dc, 

where A R n,m is exactly the net residual of the equation (23) and can be 
evaluated explicitly by 


A R 


71,771 


3 A Q n ’ m 
2 At 


— (F n 


g: 


d 

*) + —(F 
1 dry ' 


71,771 

T} ) 


d 

\ { T?n,m 


G r ; 


1 AQ n_1 
+ 2 At ' 


(30) 


Equation (29) is similar to the iterative method proposed in [8]. Note 
that, when the inner solution converges, A IF’" 1 = 0 and thus equation (23) 
is satisfied. The exact form of the LHS of (29) is thus not crucial to the 
solution of (23). Hence, higli-order approximations are applied to evaluate 
A R while stable low-order differencing schemes are used to approximate the 
LHS of (29). 

In the RHS (30), a fifth-order upwinding scheme is applied to evaluate 
the interfacial inviseid fluxes, /•’, while the sixth-order central differencing 
scheme is applied to approximate the interfacial viscous fluxes, G . Other 
differencing schemes, such as high-order Pade [3], EXO [9] and B-splining 
[2], were also proposed and are currently under test. A first-order upwind 
differencing approximation and a second-order central differencing scheme 
were applied in the LHS of (29) to ensure numerical stability. 

The resulting equation in finite volume form is given by: 
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[D + 5 ( {A ( - B ( ) + 5 V (A V - B v ) + h c (A c - BQ] A'Q = A R (31) 

where S is the finite volume operator and D = |//At. 

The diagonal matrix D can be extracted from equation (31) 


D[I + D~ l 5s(As - Bf.) + D-\{Ar, - B„) (32) 

+T>" 1 (5 C (A C - BQ] A'Q = A R. 

After factoring and reexpressing one obtains 

D[I + D-'S^At - BQ] D~ l D [I + D-'S^A, - B v )] D~ l (33) 

D[I + Z)- 1 h c (A c - B c )] A'Q = A R 

Equation (33) can be written as an ADI algorithm 

[D + S^ -B c )]A'Q l = A R 

[D + 5 V {A V — B v )\ A'Q 2 = DA'Q 1 (34) 

[D + 5 ( (A ( - BQ]A'Q 3 = DA'Q 2 

The solution of equation (34) involves the inversion of 5 x 5 block tridi- 
agonal matrices for all three directions. 

To reduce computational costs, equation (34) can be cast into the diagonal 
form of approximate factorization following the approach of Pulliam and 
Chaussee [7]. 

RJ 1 [D + (A^ - r(BQ)] R^A'Q 1 

R- 1 [D + 5 V { A v - r(BQ)\ R,A'Q 2 
R^ 1 [D + S c (A ( -r(BQ)]R c A'Q 3 

or 

[D + S^-r{BQ)]R ( .A , Q 1 = R^AR 

[D + S V (A V - r(BQ)} R V A'Q 2 = R.DA'Q 1 (36) 

[D i A(_V r(B L ))\R,A'(f = R ( DA'Q 2 


= A R 

= DA'Q 1 (35) 

= DA'Q 2 
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where the A’s and R ' s are the eigenvalue and eigenvector matrices of 
the convective Jacobian, respectively, defined in the appendix by equation 
(38) and (42). The viscous Jacobian matrices are not diagonalized with the 
same eigenvector matrices used for the inviscid Jacobian. Therefore, they are 
approximated by their spectral radii, function r(B) = max(diagonal(B )) , 
where B is defined in the appendix A. 2. The advantage of (35) is that it 
involves only the inversion of three scalar tridiagonal matrices, compared to 
the original form, (34), which requires the inversion of three block tridiagonal 
matrices. 

In contrast to the diagonal ADI method, (35), the block ADI method 
treats the viscous Jacobian matrices in a more exact manner and the method 
allows the use of implicit boundary conditions. As a result, the solution of 
(29) based on the block ADI method, (34), is more stable and the method 
allows a larger time step to be used in the calculations. On the other hand, 
the diagonal ADI method, (35), reduces the computational effort per timestep 
by a factor of 5. Therefore, the current code allows a combination of the 
diagonal and block ADI in different directions. For example, if the wall is 
normal to the //-direction, one can apply the block ADI only in the //-direction 
while the diagonal ADI is used in the other two directions. 

4 Parallel Implementation 

4.1 Algorithmic Design and Memory Management 

Since the operators S v , and 5^ are one-dimensional operators and only de- 
pendent on variables in their corresponding directions, the solution algorithm 
for the right-hand side, as well as the AD I- type left-hand side, is designed 
in such a way that data along the i, j and ^-directions (corresponding to 
the (, // and ('-directions, respectively) is first copied into one-dimensional 
scratch arrays. Then a directional independent module is called to perform 
the operations using the information provided by these one-dimensional ar- 
rays. Once the operations are completed, the resulting data are copied back 
into the three-dimensional arrays. Figure 1 shows the copying of the data 
into scratch arrays in the i and //’-directions. This implementation not only 
provides a simplification in the programming style but also ensures a contin- 
uous data flow when evaluating the LHS and RHS operators. This results 
in primary- and secondary-cache hit rates of 98% and 97%, respectively, on 
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a single- processor Origin 2000 using 120 3 grid points. The array assignment 
from 3-D to 1-D arrays and vice-versa can be easily implemented using the 
new Fortran 90 array-section feature to be discussed in the next section. 



Figure 1: Schematic data transfer of 3D arrays into ID scratch arrays 

The data management illustrated in figure 1 ensures high cache-hit rates 
but the serial performance on the Origin 2000 for a 64 3 test case showed only 
60 MFLOP/s (peak 380 MFLOP/s). Even though the average performance 
of applications on XAS computers is reported to be 40-50 MFLOP/s, we con- 
sidered this number unsatisfactory for performing large-scale DXS and LES 
simulations. To increase the memory bandwidth, all arithmetic operations 
involving the 5x5 block matrices, used in block tridiagonal and periodic block 
tridiagonal solvers, were unrolled. This resulted in a much longer and less 
desirable code, but the performance results outweigh this disadvantage. As 
shown in table 1, the unrolling of all the 5x5 matrix operations gives rise to 
a nearly tripled memory bandwidth and a doubled bandwidth for the LI and 
L2 caches, respectively. As a result, the MFLOP rate on a single processor 
increased to 120, a number we consider a satisfactory performance on the 
Origin 2000. 

4.2 Benefits of Using Fortran 90 

Fortran 90 was chosen as the programming language for the LESTool code. 
A number of new language features compared to Fortran 77 are found to 
be very useful for the development of a efficient, portable and maintainable 
program. Some of these features are highlighted below. 
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Table 1: Optimization on a single processor (64 3 grid points) 



no unrolling 

unrolling 

MFLOP/S 

60 

119 

CPU-time (s) 

703 

314 

primary cache hit rate 

0.985 

0.961 

secondary cache hit rate 

0.967 

0.947 

memory bandwidth used (MB/s) 

9.9 

28.49 

LI - L2 bandwidth used (MB/s) 

80.07 

157.84 


1. Control of precision: Because the current code is intended for 3-D 
multi-block general coordinates using high-order schemes, the numer- 
ical precision of program variables is important to reduce the main 
memory usage of the code. For example, high precision is needed for 
field quantities while lower precision can be used for the geometric 
quantities, such as surface area vectors. Moreover, for the weighting 
of the high-order interpolation functions, the use of a short integer, 
which provides a precision up to 4 significant digits, may be sufficient. 
Fortran 90 provides a convenient and portable way for such a precision 
control, as shown below. 

MODULE kind_spec_module 
IMPLICIT NONE 

INTEGER, PARAMETER :: high = SELECTED_REAL_KIND (15 , 307) 
INTEGER, PARAMETER : : low = SELECTED_REAL_KIND (6,32) 
INTEGER, PARAMETER :: short = SELECTED_INT_KIND (4) 
INTEGER, PARAMETER :: long - SELECTED_INT_KIND (9) 

END MODULE kind_spec_module 

2. Modules: As already shown in the previous example, the module syntax 
provides a mean to package global parameters, derived types and asso- 
ciated operations, and to access them wherever needed. This enables 
a more maintainable programming style. 

3. Dynamic memory control: Memory may be allocated and deallocated 
on demand in Fortran 90. This results in a more flexible implementa- 
tion that will respond to changes of grid size, as shown in the example 
below. 
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REAL (HIGH), DIMENSION :), ALLOCATABLE :: x, y, z 
READ(l) ni, n j , nk 

ALLOCATE (x (ni , nj , nk) , y(ni, n j , nk) , z(ni, n j , nk)) 
DEALLOCATE (x, y, z) 

4. Pointer variables: Pointers will enable the definition of aliases to differ- 
ent memory regions. As can be seen from the example below, all depen- 
dent variables are grouped together with the leading dimension being 
the number of variables. This arrangement is preferred by the cache- 
memory architecture while pointer variables provide precise names to 
access a single variable inside this memory block. 

ALLOCATE (variables (5 , ni, n j , nk)) 
rho => variables(l, :, :, :) 
rhou => variables (2, :, :, :) 

5. Array syntax: As discussed in section 2.1, this feature simplifies the 
programming of the ADI-algorithm, and results in a source code which 
is short and easy to comprehend. For example, the copying opera- 
tion for the ^-direction of the three-dimensional variables to the one- 
dimensional scratch array is depicted below. 

rho_ld(:) = rho_3d(i, j, :) 

6. Derived types: This feature provides a mean to group related data 
together, as shown in the next example. Here we reduce the storage 
needed for the cell-face vectors by just storing the magnitude in a low- 
precision floating point number and the directional cosines for each 
vector component in three, short integers. 

TYPE, public : : storage 
REAL (low) : : magnitude 
INTEGER (short) , DIMENSI0N(3) :: vector 
END TYPE storage 

I 

TYPE (storage) , DIMENSION (ni , n j , nk) :: normal_vector 
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7. Rich set of intrinsic functions: In addition to the powerful array syntax, 
a number of new intrinsic functions, such as TRANSPOSE, MATMUL 
and DOT PRODUCT, provide a convenient way for matrix and vec- 
tor operations. For example, interpolating a variable Of to any order 
involves a dot product of the weighting and variable vectors: 

order 

<t>f =y] H'iOi (37) 

1=1 

In Fortran 90, the interpolation can be expressed very conveniently in 
an order-independent manner. 

rho_interf ace (i) = D0T_PR0DUCT(rho_ld(is : ie) , & 
weighting( : , i)) 

For the multiblock implementation, the new features of Fortran 90, such 
as abstract data types and generic programming, have been tested. This 
programming style enabled us to reveal the weaknesses of the different vendor 
compilers. It should be mentioned that the current version (7.2.1) of the SGI 
compiler has solved all our compiler-related difficulties. 

4.3 Parallel Implementation using OpenMP 

For the parallelization of the ADI-tvpe solver, the recently developed OpenMP 
specification for programming shared-memory multiprocessors is used. SGI 
adopted the OpenMP standard for the ORIGIN series in the version 7.2.1 
compiler and HP has promised that their implementation of the OpenMP 
specification will be released in the upcoming compiler. Because OpenMP 
is a portable and scalable shared-memory multiprocessing application pro- 
gram interface (API) that gives programmers a simple and flexible interface 
for developing parallel applications, it is our belief that it will become the 
equivalent of MPI, the standard for distributed memory programming. 

The LESTool code was parallelized by placing OpenMP directives in the 
outer loop within the LHS and RHS operations. This involved decomposing 
the 3D problem into groups of ID lines, with each group assigned to a ded- 
icated processor. The efficiency of the parallel decomposition was enhanced 
by the use of the “first touch” policy, which is specific to the SGI Origin 2000. 
This implies that the memory allocated is physically placed in the processor 
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node that touches the memory location first. All large three dimensional 
blocks of memory, initialized in planes of constant k, were distributed into 
different- nodes. This allows an easy parallelization for the i- and j-directions. 

After finishing the computation in i- and ^directions, t-lie solution in 
the ^-direction is performed. On typical distributed memory-computers this 
presents a problem because the memory has been distributed in r/- planes 
and therefore no processor can access data along A>lines. In contrast, the 
solution in the ^-direction poses no difficulty on a shared-memory computer. 
In the current approach the outer loop was chosen to be in the ^-direction, 
and the 1-D partition of the i£:-planes was parallelized. 


5 Parallel Performance Results 

The mesh used for this test problem (a direct numerical simulation of a fully 
developed turbulent channel) was 120 3 grid points resulting in a memory 
usage of 1.5 Gbytes. The large arrays from this problem size do not fit- into 
the aggregate of the cache memories and the data is scattered across local 
memories in different processor nodes. The performance on a varying number 
of processors is illustrated by measuring the speedup and MFLOP rate for a 
computation of four time-steps, with 5 inner ADI iterations being performed 
at each time step. 

The results in Fig. 2 show that the speedup scales reasonably well for 
up to 16 processors (approximately 10 times speed up) while a gradual flat- 
tening of the speedup is observed when using 32 processors (approximately 
13 times speedup). The MFLOP rate per processors drops from 112 for a 
single process to 85 for 16 processors and to 75 for 32 processors. Although 
the MFLOP rate is still sufficiently high for 32 processors, the drop of the 
performance is related t-o longer memory access times. Although the memory 
access time can be further reduced by additional tuning of the code, we an- 
ticipate no major breakthrough. Instead, we propose t-o tackle this problem 
by developing a new hybrid MPI/OpenMP approach, which will be described 
in section 7. 
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Figure 2: Speedup and MFLOP/s on the SGI ORIGIN 2000 (120 3 grid 
points) 

6 DNS of fully developed turbulence in a chan- 
nel 

The results presented in this paper were based on the direct numerical sim- 
ulation of a fully developed channel flow reported in [1], The flow has two 
periodic (x and z) directions and the solid wall condition was imposed in 
the y direction. The Reynolds number based on the wall shear velocity, 
u T = VWP, is Re r = p w ii r H/ fi, w = 180, where H is half the channel width. 
The streamwise and spanwise dimensions of the channel are AnH and 2i tH, 
respectively. The computation is carried out on a grid using 200x121x200 
grid points in x, y, and z, respectively. The flow held is initialized with a 
laminar solution and random fluctuations are superimposed on the pressure 
held. The governing equations are then integrated in time until a statistical 
equilibrium is reached ( t,u T /H > 30). 

Figure 3 shows the dimensionless velocity, u + = u/u T , plotted against 
dimensionless wall distance, y + = yu T /v. Also shown in this figure is the 
comparison of the predicted mean velocity profile against the data cited in 
[1] for the same Reynolds number. The dotted line represents the desirable 
profile in the viscous sublayer, u + = y + , and the dashed line denotes the 
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log-law line, u + = ln(?/ + )/0.41 +5.2. The figure shows that the current mean 
velocity profile matches the data of Kim et al. [1] very well. A detailed 
comparison of the DXS statistics of turbulent quantities will be presented 
elsewhere. 



Figure 3: Mean velocity profiles: — , LESTool; o, DXS of [1 1; , viscous 

sublayer; — — — — log layer. 


7 Outlook - Combining MPI/OpenMP 

The future of high-performance computer hardware will continue to evolve 
in the direction of clusters of SMP computers. In this model, SMP comput- 
ing nodes are interconnected by fast, high-speed data links. While OpenMP 
provides a convenient way for parallel programming, MPI is the natural ap- 
proach for distributed computing. In further development of the LESTool 
code, we therefore propose to use a hybrid MPI/OpenMP approach in an 
attempt to combine the best features of the two approaches. The hybrid 
approach provides the flexibility to choose between shared and distributed 
memory computing, or a combination of the two. 

Our ultimate goal is to use the CFD code for the simulation of complex 
geometries by applying the multi-block concept. This strategy is designed to 
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take advantage of the hybrid MPI/OpenMP practice. From the overall pool 
of processors, different groups of processors will be clustered together. These 
clusters of processors will contain one or more grid blocks and communicate 
using MPI. Within each cluster the solver is parallelized using OpenMP. A 
schematic overview of this concept is depicted in figure 4. Testing of the code 
using this hybrid practice is currently underway. 


r 

distributed memory parallel 


s 

shared memory 
parallel 


shared memory 
parallel 

> 

OpenMP 

\ — ^ 

OpenMP 

ilii: 

MPI 

liill 


Figure 4: A concept for a hybrid MPI/OpenMP parallelization 


8 Concluding Remarks 

In this paper the numerical method for a high-order CFD code in curvilinear 
coordinates is described and two variants of the iterative implicit time inte- 
gration scheme are discussed in detail. The DXS results obtained with this 
code show good agreement with results reported in the literature. 

Experience on the shared-memory parallelization of an ADI-tvpe CFD 
code is reported. Data management by loading 3-D arrays into 1-D arrays 
has been found to be very successful for the cache-based architectures. One 
surprising finding is the need to unroll all 5-by-5 matrix operations to achieve 
high MFLOP rates. 

The advantage of Fortran 90 in implementing the current code is clear. 
The excellent new features of Fortran 90 in comparison to Fortran 77 of- 
fer the possibility to create an efficient, portable and maintainable program. 
Although compiler errors from different vendors have been encountered, we 
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found the overall advantages of using this new language outweigh the disad- 
vantages. The performance of the current 7.2.1 version of the SGI compiler 
is very satisfactory. 

The new OpenMP directives are used to parallelize the code on the SGI 
Origin 2000 computer. Open .VI P offers a portable and simple way to use 
directive-based parallelization for the current LESTool code. We believe 
that for shared-memory parallelization OpenMP will be as widely accepted 
as its counterpart MPI for distributed parallel processing. 

The scaling of the current code is fairly satisfactory up to 16 processors. 
However, the parallel efficiency for more than 32 processors is only marginal 
due to excess memory-access time. 

A hybrid MPI/ OpenMP concept that combines the best features of both 
MPI and OpenMP standards is proposed. This practice can be coupled with 
the multi-block strategy of the current CFD code and is believed to offer 
the additional mileage needed to speed up the code for multi-node, multi- 
processor computer architectures. 
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A Appendix 

A.l Euler Flux Jacobians 



Figure 5: Discrete control volume 

The Jacobian Matrix A x , where x denotes £, r; or £ direction, at a control 
volume face shown in figure 5 can be defined as 


A X = ^ = R ~ 1 A X R 

where A x and R are the eigenvalue and eigenvector matrices of the Jacobian, 
respectively. The diagonal matrix A x can be written as 


A = 


/ A X) i 0 0 0 0 \ 

0 A Xj2 0 0 0 

o o’ a X j3 o 0 

0 0 0 a X j4 o 

0 0 0 o' / 


where 


(38) 
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( 39 ) 



The eigenvectors R can be expressed by the product of the three matrices 


R = CNM (42) 

where 


/ -c 2 0 0 0 1 \ 

0 0 10 0 

C= 0 0 0 1 0 (43) 

0 c 0 0 1 

\ 0 -c 001/ 

/ 1 0 0 0 0 \ 

0 Uni H n2 0 

N = 0 n b i n b 2 n b 3 0 (44) 

0 n c i n c2 c c 3 0 

\ 0 0 0 0 1 / 

and 


(l 0 0 0 0 \ 

—u 1 0 00 

M = -v 0 1 0 0 (45) 

—w 00 10 

y (7 — 1 )K — (7— 1 )u — (7 — \)v — (7— 1 )w 7— 1 / 

In equation (44), vectors (n nl , n n2 , n n3 ), (n bl , n b2 , n b3 ) and (n clj n c2 , rp 3 ) 
corresponds to the unit vectors normal to the control volume face, (13 x 24), 
and the units vectors in the 1-3 and 2-4 directions, respectively, as shown in 
figure 5. 
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The transformation R converts the conservative variables into Riemann 
variables 

RSQ = (Sp — c 2 5p , p5Un , p 8 Ut 2, bp + pc5U n , Jp — pcSU n ) T (46) 

where U n is the contravariant velocity normal to the control volume face 
(l3 x 2/1) and and Ut 2 are two tangential velocities in the and 2t direc- 
tions, respectively. 

A. 2 Viscous Flux Jacobians 

Similarly, the Jacobian matrices B x can be computed as 

0 0 0 0 0 \ 

ni22U — m,2zv — m2\w rno2 rri2z 777 2 4 0 

m 2i u - rn-zzv - w. ?A w m 2 3 rn 33 m 34 0 ( 47 ) 

m 2 4U — rn 34 u — 7774477; rn 2 4 77734 77744 0 

bi b 2 b 3 64 m 55 / 


b 2 

h 

64 

and 

774 22 = z/V [4(<9\/ch) 2 /3 + (dx/dy) 2 + (3x/cR) 2 ] 
rn 33 = z/V r [(%/ftc ) 2 + 4(<9x/<9?/) 2 /3 + (Ry/dz) 2 ] 

7,744 = vV[{dx/dx f + ( dx/dy ) 2 + 4(<9y/<9z) 2 /3] 

777.55 = ^[(dy/dx) 2 + (dx/dy) 2 + {dx/dz) 2 ] (49) 

777 23 = vV{dx/dx)(dx/dy)/3 
m 24 = vV{dx/dx){dx/dz )/ 3 
r?7 34 = (dx/dy) (dx/ dz) / 3 


= — 77722 M 2 " 777 33 77 2 — /?/ ; pf 2 + 7?7 55 (/i — C t ,T) 

— 21112ZUV — 2?77 2 4 77777 — 2 777 3 4 77777 
= (?7l 22 - 77755)77 + 7?7 23 77 + 77724777 (48) 

= 777 2 3 77 + (r77 3 3 - 777 55 ) 77 + 7?7 3 4777 
= 7772 4 77 + 777 3 4 77 + (777.44 — m 55 )w 
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A. 3 Coordinate transformation 

Gradients of any variable o are evaluated using the following transformation: 


/ dx/da dxjdb dx/dc \ / do/dx \ / do/ da \ 

I dy/da dy/db dy/dc | dd/dy I = I do/Qb I (50) 
y dz/da dz/db dz/dc J y do/di J y do/dc J 

where a, b and c; are components of the transformed coordinates defined 
by the P-E, 1-3 and 2-4 directions, respectively, as shown in figure 5. 

A. 4 Metric vectors 

The metric vectors V(d^/dx, d^/dy, d^/dz), V(dr]/dx, drj/dy, drj/dz ) and 
V(d(,/dx, dC,/dy , d(,/dz ) are the the surface vectors normal to the control 
volume faces in the £, rj and ( directions, respectively and can be written in 
a general form as (Si, S 2 , S3), which can be defined by the cross product of 
the vector 1)1 and 24 for any control volume face as shown in figure 5 

S = h-(l3x2i) (51) 

Note that S and Pi/ are not necessarily in the same direction in general 
curvilinear coordinates. 
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